Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure metrics for input diagnostics #20933

Merged
merged 5 commits into from
Nov 15, 2024
Merged

Failure metrics for input diagnostics #20933

merged 5 commits into from
Nov 15, 2024

Conversation

patrickmann
Copy link
Contributor

@patrickmann patrickmann commented Nov 11, 2024

/nocl new feature
Relates to issue #20683

Adds meters for every input that experiences processing and/or indexing failures. The meter names are
org.graylog2.<input Id>.failures.indexing
org.graylog2.<input Id>.failures.processing

These metrics are available in Open and Enterprise, independent of configuring the Processing and Indexing Failure stream.

@patrickmann patrickmann marked this pull request as ready for review November 12, 2024 08:13
Copy link
Contributor

@AntonEbel AntonEbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check could be a problem. I don't know if we need it here but if we do, we need to do the metric stuff before. At least for processing.

@patrickmann
Copy link
Contributor Author

Moved the metric code up into FailureSubmissionService. It now runs ahead of any checks.
Verified that it still updates the metrics with config set to disabled; and when there is no license.

Copy link
Contributor

@AntonEbel AntonEbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add the metrics after the message.supportsFailureHandling()
method. I've looked at the implementations of this interface method and it looks like messages from the inputs here always return true. Even if we extend the diagnostics page one day to show detailed information from the indexing/processing failure index, hopefully we won't have a discrepancy in the amount of messages then.

@patrickmann
Copy link
Contributor Author

Even if we extend the diagnostics page one day to show detailed information from the indexing/processing failure index, hopefully we won't have a discrepancy in the amount of messages then.

I was thinking it preferable to show a failure count, even if the messages don't support failure handling. At least you have an indication that something is wrong. I get your point of avoiding a discrepancy in counts, but I'd still prefer to see the "true" failure count. If this becomes an issue, we can hopefully label the data appropriately on the UI so it is not confusing to the user.

Copy link
Contributor

@AntonEbel AntonEbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AntonEbel AntonEbel merged commit ca8452a into master Nov 15, 2024
6 checks passed
@AntonEbel AntonEbel deleted the failureMetrics branch November 15, 2024 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants